AITopics | scaling law

Collaborating Authors

scaling law

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Asymmetric Scaling Laws from Sparse Features

Sous, John, Winer, Michael

arXiv.org Machine LearningMay-25-2026

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

artificial intelligence, machine learning, regime, (15 more...)

arXiv.org Machine Learning

2605.23591

Country: North America > United States > Connecticut (0.28)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits

Czech, Eric, Xu, Zhiwei, Elmatad, Yael, Wang, Yixin, Held, William

arXiv.org Machine LearningMar-31-2026

Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the $3.8\times10^{25}$ FLOP training budget and \$1.4M (90% CI: \$412K-\$2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry ($α\neq β$). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient, numerically unstable, prone to local minima, and harder to implement. Each concern is shown to be unfounded or addressable, especially when the partially linear structure of the objective is exploited via Variable Projection, enabling unbiased inference on all five loss surface parameters through a two-dimensional optimization that is well-conditioned, analytically differentiable, and amenable to dense, or even exhaustive, grid search. It may serve as a more convenient replacement for Approach 2 or a more scalable alternative for adaptations of Approach 3 to richer scaling law formulations. See https://github.com/Open-Athena/vpnls for details and https://openathena.ai/scaling-law-analysis for other results from this study.

large language model, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2603.22339

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Neural Information Processing SystemsMar-21-2026, 21:53:26 GMT

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1\% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.

artificial intelligence, large language model, natural language, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Observational Scaling Laws and the Predictability of Langauge Model Performance

Neural Information Processing SystemsMar-18-2026, 18:06:18 GMT

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.

artificial intelligence, natural language, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

a4628e9fbd3002a554923642f74d5d6b-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 04:22:52 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

1cded4f97cf5f01a284c574110b7e3b9-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 23:29:42 GMT

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology (0.45)
Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)

Add feedback

Scaling Laws for Hyperparameter Optimization

Neural Information Processing SystemsDec-26-2025, 09:07:57 GMT

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the dominant power law nature of learning curves for Bayesian optimization. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.

hyperparameter optimization, name change, scaling law, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

Scaling laws for language encoding models in fMRI

Neural Information Processing SystemsDec-24-2025, 22:34:02 GMT

Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar log-linear behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.

language model, name change, scaling law, (4 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Health Care Technology (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Neural Information Processing SystemsDec-24-2025, 13:47:53 GMT

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

compute-optimal model design, name change, scaling law, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models

Nimmaturi, Datta, Bhargava, Vaishnavi, Ghosh, Rajat, George, Johnu, Dutta, Debojyoti

arXiv.org Artificial IntelligenceDec-2-2025

Fine-tuning large language models (LLMs) for complex reasoning with reinforcement learning (RL) continues to be prohibitively expensive. Through a phenomenological investigation of GRPO post-training dynamics, we identify a scaling law characterized by exponential reward saturation. The emergence of this early plateau motivates an important question: can GRPO be equipped with principled early stopping criteria to significantly reduce post-training compute while preserving downstream performance? Across four open-source models--Llama 3B/8B and Qwen 3B/7B--we perform a systematic empirical study of GRPO fine-tuning and derive scaling laws that accurately predict reward trajectories during training. Our analysis shows that GRPO reward curves are well-approximated by an exponential saturation with three phases that are consistent across all models: (i) slow initial progress, (ii) rapid improvement, and (iii) saturation. We further show that a simple parametric scaling law, conditioned on model size, initial performance, and normalized training progress, reliably predicts the onset of plateauing performance. A key practical finding is that training beyond roughly 80% of a single epoch yields negligible reward gains while consuming a substantial fraction of total computation. Using our scaling law, practitioners can forecast these phase transitions early and select data-driven stopping points, substantially reducing GRPO compute without sacrificing final performance. Our results suggest that such predictive scaling laws are a promising tool for managing GRPO finetuning costs.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.18014

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback